Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

نویسندگان

  • Sven Rahmann
  • Eric Rivals
چکیده

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q−1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Expected Number of Missing Words in a Random Text

The number of words of length q (q-grams) in a text or the number of common q-grams between two texts are used in pattern matching algorithms, as distance measure between texts or for testing the randomness of pseudorandom number generators. Despite this broad range of applications, no exact systematic statistical studies of those numbers can be found in the literature. We propose an algorithm ...

متن کامل

پروتکل کارا برای جمع چندسویه امن با قابلیت تکرار

In secure multiparty computation (SMC), a group of users jointly and securely computes a mathematical function on their private inputs, such that the privacy of their private inputs will be preserved. One of the widely used applications of SMC is the secure multiparty summation which securely computes the summation value of the users’ private inputs. In this paper, we consider a secure multipar...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Reliable location-allocation model for congested systems under disruptions using accelerated Benders decomposition

This paper aims to propose a reliable location-allocation model where facilities are subject to the risk of disruptions. Since service facilities are expected to satisfy random and heavy demands, we model the congested situations in the system within a queuing framework which handles two sources of uncertainty associated with demand and service. To insure the service quality, a minimum limit re...

متن کامل

Deriving the Exact Cost Function for a Two-Level Inventory System with Information Sharing

In this paper we consider a two-level inventory system with one warehouse and one retailer with information exchange. Transportation times are constant and retailer faces independent Poisson demand. The retailer applies continuous review (R,Q)-policy. The supplier starts with m initial batches (of size Q), and places an order to an outside source immediately after the retailer’s inventory posit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000